image from Unsplash
image from Unsplash

Problem Identification and Research Question :

In the current landscape of the film industry, data analysis has become an essential tool for grasping market trends and predicting the success of movies. In this data analysis endeavor, our aim is to delve into an extensive film database sourced from The Movie Database (TMDB) to pinpoint and categorize the standout films of the past five years.

This data analysis project is geared towards offering a comprehensive and updated perspective on the cinematic scene. Leveraging data analysis techniques in R, our objective is to identify the top-performing movies from the last five years and gain insights into the factors underpinning their success.

The Data :

In this data analysis project, we will utilize a comprehensive database containing over 1 million movies (1008166), sourced from Full Movies TMDB 2024. The dataset comprises the following variables:

Variable Description
id Unique identifier for each movie
vote_count Number of votes received by the movie
revenue Revenue generated by the movie
backdrop_path Path to the backdrop image of the movie
imdb_id IMDb identifier of the movie
overview Summary or plot synopsis of the movie
tagline Promotional tagline of the movie
production_countries Countries where the movie was produced
title Title of the movie
status Production status of the movie
runtime Duration in minutes of the movie
budget Production budget of the movie
original_language Original language of the movie
popularity Popularity index of the movie
genres Genres of the movie
spoken_languages Spoken languages in the movie
vote_average Average rating of the movie by users
release_date Release date of the movie
adult Indicator of whether the movie is for adults
homepage Official homepage of the movie
original_title Original title of the movie
poster_path Path to the poster image of the movie
production_companies Production companies involved in the movie
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
setwd("~/Proyectos R/archive (5)")


TMDB_data <- read.csv("TMDB_movie_dataset_v11.csv")

Data Cleaning and Preprocessing :

View Data

head(TMDB_data, 10)

View Structure

glimpse(TMDB_data)
## Rows: 1,008,166
## Columns: 23
## $ id                   <int> 27205, 157336, 155, 19995, 24428, 293660, 299536,…
## $ title                <chr> "Inception", "Interstellar", "The Dark Knight", "…
## $ vote_average         <dbl> 8.364, 8.417, 8.512, 7.573, 7.710, 7.606, 8.255, …
## $ vote_count           <int> 34495, 32571, 30619, 29815, 29166, 28894, 27713, …
## $ status               <chr> "Released", "Released", "Released", "Released", "…
## $ release_date         <chr> "2010-07-15", "2014-11-05", "2008-07-16", "2009-1…
## $ revenue              <dbl> 825532764, 701729206, 1004558444, 2923706026, 151…
## $ runtime              <int> 148, 169, 152, 162, 143, 108, 149, 139, 121, 154,…
## $ adult                <chr> "False", "False", "False", "False", "False", "Fal…
## $ backdrop_path        <chr> "/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg", "/pbrkL804c8y…
## $ budget               <int> 160000000, 165000000, 185000000, 237000000, 22000…
## $ homepage             <chr> "https://www.warnerbros.com/movies/inception", "h…
## $ imdb_id              <chr> "tt1375666", "tt0816692", "tt0468569", "tt0499549…
## $ original_language    <chr> "en", "en", "en", "en", "en", "en", "en", "en", "…
## $ original_title       <chr> "Inception", "Interstellar", "The Dark Knight", "…
## $ overview             <chr> "Cobb, a skilled thief who commits corporate espi…
## $ popularity           <dbl> 83.952, 140.241, 130.643, 79.932, 98.082, 72.735,…
## $ poster_path          <chr> "/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg", "/gEU2QniE6E7…
## $ tagline              <chr> "Your mind is the scene of the crime.", "Mankind …
## $ genres               <chr> "Action, Science Fiction, Adventure", "Adventure,…
## $ production_companies <chr> "Legendary Pictures, Syncopy, Warner Bros. Pictur…
## $ production_countries <chr> "United Kingdom, United States of America", "Unit…
## $ spoken_languages     <chr> "English, French, Japanese, Swahili", "English", …

Upon examination, our database boasts an impressive count of precisely 1,008,166 movies. Furthermore, we’re equipped with a rich array of over 20 variables for comprehensive analysis.

Variable Types Breakdown:

  • chr (Character): Comprising 16 variables

  • int (Integer): Encompassing 4 variables

  • dbl (Double): Incorporating 3 variables

Handling Missing Values :

missing_values <- colSums(is.na(TMDB_data))

tibble(Variables = names(missing_values), missing_values = missing_values)

Great news! Our dataset is clean and free of any missing values. We’re all set to dive into our analysis with confidence!

Handling Outliers

numeric_variable <- TMDB_data %>%
  select_if(is.numeric)

describe(numeric_variable)

Interesting, firstly, we notice that both revenue and runtime have negative values, which is entirely impossible in movies. revenue also has many values equal to 0, which is logical because nowadays a large portion of movies released do not exclusively premiere in big cinemas but also on platforms like Netflix, HBO MAX, Disney+, etc. This makes it very complicated to calculate movie revenues. Additionally, the maximum value of runtime is an enormous number, which also lacks logic.

We also observe:

Let’s attempt to correct these outliers:

Outliers_values <- TMDB_data %>%
  filter( title %in% c("Adventures in Bora Bora", "The Bear") | title == "Untitled" | original_title == "Untitled" | status == "Canceled" | status == "Rumored" | status == "Planned" | revenue < 0 | runtime < 0 )

#checking revenue 
TMDB_data <- TMDB_data %>%
  filter(!revenue < 0, !title %in%  c("Adventures in Bora Bora", "The Bear"))


#checking runtime
TMDB_data <- TMDB_data %>%
  filter(!runtime < 0)

#checking title without names
TMDB_data <- TMDB_data %>%
  filter(!title == "Untitled")

#checking Films canceled, Rumored and Planned
TMDB_data <- TMDB_data %>%
  filter(!status %in% c("Canceled", "Rumored", "Planned"))



Outliers_values

Data Transformation

Previously, we observed a significant number of character-type variables. Let’s proceed by transforming these variables into categorical variables to facilitate subsequent movie analysis.

TMDB_data_pre <- TMDB_data %>%
  select(-release_date) %>%
  mutate_if(is.character, as.factor)

summary(TMDB_data_pre)
##        id               title          vote_average      vote_count      
##  Min.   :      2   Home    :    141   Min.   : 0.000   Min.   :    0.00  
##  1st Qu.: 370952   Alone   :     89   1st Qu.: 0.000   1st Qu.:    0.00  
##  Median : 665900   Mother  :     83   Median : 0.000   Median :    0.00  
##  Mean   : 664662   Love    :     70   Mean   : 2.102   Mean   :   21.39  
##  3rd Qu.: 975640   The Gift:     70   3rd Qu.: 5.000   3rd Qu.:    1.00  
##  Max.   :1260267   The Box :     68   Max.   :10.000   Max.   :34495.00  
##                    (Other) :1001981                                      
##              status          revenue             runtime          adult       
##  In Production  :  5909   Min.   :0.000e+00   Min.   :    0.0   False:908922  
##  Post Production:  5381   1st Qu.:0.000e+00   1st Qu.:    1.0   True : 93580  
##  Released       :991212   Median :0.000e+00   Median :   30.0                 
##                           Mean   :7.388e+05   Mean   :   50.7                 
##                           3rd Qu.:0.000e+00   3rd Qu.:   90.0                 
##                           Max.   :2.930e+09   Max.   :14400.0                 
##                                                                               
##                           backdrop_path        budget         
##                                  :717554   Min.   :        0  
##  None                            :   883   1st Qu.:        0  
##  /3CxwYgqGtJ6UEGfWUT0gMYCIlFP.jpg:   157   Median :        0  
##  /pOXuMdKnWO9hK8drDahJVQxILHx.jpg:    66   Mean   :   289206  
##  /6r2onqJ2S7XhtnU3HbvNmEv8SXK.jpg:    47   3rd Qu.:        0  
##  /60t9ckELUGyinDQNDULFDq2i7u7.jpg:    44   Max.   :900000000  
##  (Other)                         :283751                      
##                            homepage            imdb_id       original_language
##                                :893656             :430416   en     :538915   
##  https://animation.geidai.ac.jp:   147   None      :   832   fr     : 60589   
##  http://www.eldoradofilms.com  :    73   tt13904644:    18   es     : 51724   
##  https://www.yamidouga.com/    :    63   tt26900526:     8   de     : 50974   
##  https://uaebabes.com          :    57   tt23810972:     7   ja     : 43919   
##  http://www.ufc.com            :    52   tt27048168:     7   pt     : 28185   
##  (Other)                       :108454   (Other)   :571214   (Other):228196   
##  original_title                    overview        popularity      
##  Home   :     91                       :182271   Min.   :   0.000  
##  Carmen :     62   Mexican feature film:   912   1st Qu.:   0.600  
##  Limbo  :     57                       :   619   Median :   0.600  
##  Solo   :     56   Plot Unavailable.   :   548   Mean   :   1.346  
##  Alone  :     55   Hong Kong movie     :   450   3rd Qu.:   0.912  
##  The Box:     50   Nikkatsu Roman Porno:   356   Max.   :2994.357  
##  (Other):1002131   (Other)             :817346                     
##                            poster_path                    tagline      
##                                  :276670                      :858506  
##  None                            :   598   English            :   165  
##  /wtoKLMm4UvkwvcSwO3XWcs1gJuF.jpg:    54   animation short    :   104  
##  /cWjdh8VTiizYfQp5m6fJi4PDy8w.jpg:    48   100% Lesbian       :    73  
##  /je3JbUs3OEoYkS6Vd7iv7w6HUPu.jpg:    45   Documentary        :    60  
##  /qpXweJ0Gbl5OmYZqzNWtDJovF8e.jpg:    41   AN INSIDE NO.9 FILM:    46  
##  (Other)                         :725046   (Other)            :143548  
##          genres                production_companies
##             :371589                      :527616   
##  Documentary:124523   Evil Angel         :  2805   
##  Drama      : 99077   ONF | NFB          :  2130   
##  Comedy     : 55629   BBC                :  2025   
##  Animation  : 28941   Metro-Goldwyn-Mayer:  2025   
##  Music      : 23773   Columbia Pictures  :  1927   
##  (Other)    :298970   (Other)            :463974   
##                production_countries    spoken_languages 
##                          :409960               :399804  
##  United States of America:165664    English    :213258  
##  Japan                   : 37093    Japanese   : 36555  
##  Germany                 : 31611    French     : 33940  
##  United Kingdom          : 31258    Spanish    : 33645  
##  France                  : 30628    No Language: 26900  
##  (Other)                 :296288    (Other)    :258400

Excellent, we now have a somewhat clearer overview of the database for our subsequent analysis.

Nevertheless, we don’t require as many variables for our primary goal; let’s choose only the most pertinent variables and data from 2018 onwards.

TMDB_data_pre$release_date <- TMDB_data$release_date

TMDB_data <- TMDB_data_pre

TMDB_data <- TMDB_data %>%
  select(id, title, release_date, status, runtime,popularity, vote_average, vote_count, revenue, budget, production_companies, genres, original_language, adult) %>%
  filter(release_date > 2018)



TMDB_data$release_date <- as.Date(TMDB_data$release_date, format =  "%Y-%m-%d")

TMDB_data$year <- year(TMDB_data$release_date)
TMDB_data$month <- month(TMDB_data$release_date, label = TRUE)


TMDB_data %>%
  filter(vote_average > 7) %>%
  arrange(desc(popularity)) %>%
  head()

We’ve pruned unnecessary variables and filtered our updated database to include only films released after 2018, focusing on the past 5 years.

Let’s keep in mind our goal: to identify the cream of the crop! To do so, we’ll narrow our focus to feature-length films with substantial budgets, relying on popularity and average ratings to pinpoint the top contenders.

TMDB_data <- TMDB_data %>%
  filter(runtime > 60) # para excluir cortometrajes

TMDB_data <- TMDB_data %>% # para excluir la gran mayoria de peliculas con bajo presupesto
  filter(budget > 100000) %>%
  mutate(revenue_in_million = round(revenue/1000000, 3), budget_in_million = round(budget/1000000,3)) 


TMDB_data %>%
  select(-c("revenue", "budget"))
TMDB_data %>%
  arrange(budget_in_million)
TMDB_data <- TMDB_data %>%
  filter(vote_average > 7) %>%
  arrange(desc(popularity))
glimpse(TMDB_data)
## Rows: 738
## Columns: 18
## $ id                   <int> 565770, 980489, 1008042, 385687, 678512, 346698, …
## $ title                <fct> "Blue Beetle", "Gran Turismo", "Talk to Me", "Fas…
## $ release_date         <date> 2023-08-16, 2023-08-09, 2023-07-26, 2023-05-17, …
## $ status               <fct> Released, Released, Released, Released, Released,…
## $ runtime              <int> 128, 135, 95, 142, 131, 114, 102, 118, 124, 127, …
## $ popularity           <dbl> 2994.357, 2680.593, 1458.514, 1175.267, 1111.036,…
## $ vote_average         <dbl> 7.139, 8.068, 7.214, 7.265, 7.973, 7.279, 7.757, …
## $ vote_count           <int> 1023, 702, 973, 3881, 503, 5074, 2467, 56, 21, 32…
## $ revenue              <dbl> 124818235, 114800000, 72600000, 704709660, 212587…
## $ budget               <int> 120000000, 60000000, 4500000, 340000000, 15000000…
## $ production_companies <fct> "Warner Bros. Pictures, The Safran Company, DC Fi…
## $ genres               <fct> "Action, Science Fiction, Adventure", "Action, Dr…
## $ original_language    <fct> en, en, en, en, en, en, en, en, en, en, en, en, e…
## $ adult                <fct> False, False, False, False, False, False, False, …
## $ year                 <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ month                <ord> ago, ago, jul, may, jul, jul, jun, sep, sep, jun,…
## $ revenue_in_million   <dbl> 124.818, 114.800, 72.600, 704.710, 212.587, 1428.…
## $ budget_in_million    <dbl> 120.0, 60.0, 4.5, 340.0, 15.0, 145.0, 200.0, 13.0…

Amazingly, after completing all the data cleaning and transformation, we’re left with just 739 movies out of over 1 million movies.

Data Analysis

With the necessary data at hand, we delve into an in-depth analysis of top movies, examining each variable meticulously.

Analysis

ggplot(TMDB_data, aes(year))+
  geom_histogram(binwidth = 1, fill = "#67C6E3", color = "black") +
  labs(title = "Total Number of Top Movies Released from 2018 to 2024", x = "year", y = "Number of Movies")

ggplot(TMDB_data, aes(x = month)) +
  geom_bar(fill = "#67C6E3") +
  theme(axis.text.x = element_text(angle = 90, vjust = -0.01)) +
  facet_wrap(~year) +
  labs(title = "Total Number of Top Movies Released by Year and Month (2018-2024)", x = "Month", y = "Number of Movies")

ggplot(TMDB_data, aes(runtime)) +
         geom_histogram(fill = "#67C6E3") + 
  labs(title = "Movie Runtime Distribution", x = "Runtime (in minutes)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TMDB_data %>%
  filter(runtime > 400)

We observe a predominant trend towards movies with a duration of approximately 100 minutes, translating to around 1 hour and 40 minutes. This duration seems to be the favored choice among filmmakers for captivating storytelling.

ggplot(TMDB_data, aes(popularity)) +
         geom_histogram(fill = "#67C6E3") +
  labs(title = "Movie Popularity Distribution", x = "Popularity", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TMDB_data %>%
  filter(popularity > 2000)

Upon closer examination of the dataset, it becomes evident that the majority of movies fall into the realm of relative obscurity, marked by a left-skewed distribution in popularity scores. However, two outliers, “Gran Turismo” and “Blue Beetle,” stand out as exceptionally popular among audiences, defying the norm.

ggplot(TMDB_data, aes(vote_average)) +
         geom_histogram(fill = "#67C6E3")+
  labs(title = "Movie Vote Average Distribution", x = "Vote Average", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram exhibits a slight left-skew, with scores predominantly clustered in the lower range, typically between 7 and 8 out of 10.

ggplot(TMDB_data, aes(vote_count)) +
         geom_histogram(fill = "#67C6E3")+
  xlim(0,27714)+
  ylim(0,100) +
  labs(title = "Movie Vote Distribution", x = "Vote", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

TMDB_data %>%
  filter(vote_count > 20000)

We observe a left-skewed distribution, similar to that of popularity, indicating that the majority of movies in our dataset receive relatively few votes. This trend suggests that these films may struggle to gain traction due to their limited visibility, making it difficult for TMDB users to discover these lesser-known gems. However, it’s noteworthy to mention the top four highest-rated movies, two of which hail from the Marvel universe:

  • Avengers: Infinity War: Garnering 27,713 votes with an impressive average rating of 8.25

  • Avengers: Endgame: Accumulating 23,857 votes with a closely comparable average rating of 8.26

  • Joker: Amassing 23,425 votes with a commendable average rating of 8.16

  • Black Panther: Securing 21,053 votes with a respectable average rating of 7.39

ggplot(TMDB_data, aes(revenue_in_million)) +
         geom_histogram(fill = "#67C6E3")+
  labs(title = "Movie Revenue Distribution in million", x = "Revenue (in million)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similarly, we witness a left-skewed distribution in this histogram, indicative of the complexities associated with precisely measuring the true earnings of films released via online platforms. The prevalence of zero scores within our dataset underscores the inherent difficulty in assessing their financial performance accurately. Additionally, it’s noteworthy that the majority…

ggplot(TMDB_data, aes(budget_in_million)) +
         geom_histogram(fill = "#67C6E3")+
  labs(title = "Movie Budget Distribution in million", x = "Budget (in million)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TMDB_data %>%
  filter(budget_in_million > 300)

We continue to notice a left-skewed pattern in the histogram, suggesting that most films have budgets below the one-million-dollar threshold, typically falling within the range of $100,000 to $1,000,000. Notably, the three films commanding the highest budgets are:

  • Avatar: The Way of Water: $460 million

  • Avengers: Endgame: $356 million

  • Fast & Furious X: $340 million

TMDB_data_earnings_losses <- TMDB_data %>%
  filter(revenue_in_million > 0) %>%
  mutate(earnings_and_losses = revenue_in_million-budget_in_million)

TMDB_data_earnings_losses <- TMDB_data_earnings_losses %>%
  mutate(earn_of_lose = ifelse(earnings_and_losses >= 0, "Earnings", "Losses"), year = year(release_date))

TMDB_data_average_gains_and_losses <- TMDB_data_earnings_losses %>%
  group_by(year, earn_of_lose) %>%
  summarize(mean_earn_lose = round(mean(earnings_and_losses), 3))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(TMDB_data_average_gains_and_losses, aes(year, mean_earn_lose, fill = earn_of_lose))+
  geom_bar(stat = "identity")+
  scale_fill_manual(values = c("Earnings" = "#67C6E3", "Losses" = "#FF6969")) +
  geom_text(aes(label = mean_earn_lose, y = ifelse(earn_of_lose == "Earnings", mean_earn_lose * 1.05, mean_earn_lose * 1.4)), color = "black") +
  labs(title = "Stacked Bar Chart of Average Gains and Losses by Year (2018-2024)", x = "", y = "mean_(in million)", fill = "earn/lose")

We can discern that 2019 emerged as a standout year, boasting the highest average profits per film, exceeding $232 million. However, it also faced losses averaging around $11 million per film. Undoubtedly, it marked a significant milestone for the cinema industry. Conversely, the subsequent year witnessed a substantial downturn in profits among top-performing movies, with earnings hovering around $77 million and losses escalating to $23 million per film. While this figure may appear substantial to the average viewer, within the cinematic realm, it signifies a considerable setback, particularly considering these films’ high ratings and widespread popularity aimed at societal impact. It’s crucial to acknowledge that 2020 bore the brunt of the COVID-19 pandemic, with global lockdowns likely contributing to the decline in profits for most films.

Following that challenging period, subsequent years showed promise, with profits more than doubling in 2021 compared to 2020, reaching approximately $170 to $210 million.

TOP3_language <- TMDB_data_earnings_losses %>%
  filter(!original_language == "cn") %>%
  group_by(original_language) %>%
  summarize(mean_earnings = mean(earnings_and_losses)) %>%
  arrange(desc(mean_earnings)) %>%
  head(3)


ggplot(TOP3_language, aes(reorder(original_language, mean_earnings), mean_earnings))+
  geom_bar(stat = "identity", fill = "#67C6E3")+
  coord_flip()+
  labs(title = "Top 3 Languages by Average Revenue", x = "", y = "Mean revenue (in million)")

Impressively, Asian films command a substantial presence in the film industry. Specifically, a standout Chinese (zh) production released between 2018 and 2024 boasts an average profit nearing $300 million. This highlights the growing influence and lucrative nature of Asian cinema on a worldwide scale.